On Automatic Information Extraction from Large Web Sites
نویسندگان
چکیده
Information extraction from Web sites is nowadays a relevant problem, usually performed by software modules called wrappers. A key requirement is that the wrapper generation process should be automated to the largest extent, in order to allow for large-scale extraction tasks even in presence of changes in the underlying sites. So far, however, only semi-automatic proposals have appeared in the literature. We present a novel approach to information extraction from Web sites, which reconciles recent proposals for supervised wrapper induction with the more traditional field of grammar inference. Grammar inference provides a promising theoretical framework for the study of unsupervised – i.e., fully automatic – wrapper generation algorithms. However, due to some unrealistic assumptions on the input, these algorithms are not practically applicable to Web information extraction tasks. The main contributions of the paper stand in the definition of a class of regular languages, called the prefix mark-up languages, that nicely abstract the structures usually found in HTML pages, and in the definition of a polynomial-time unsupervised learning algorithm for this class. The paper shows that, differently from other known classes, prefix mark-up languages and the associated algorithm can be practically used for information extraction purposes. A system based on the techniques described in the paper has been implemented in a working prototype. Experiments on known Web sites further demonstrate the practical applicability of the proposed approach.
منابع مشابه
Presenting a method for extracting structured domain-dependent information from Farsi Web pages
Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...
متن کاملAutomatic Web Information Extraction in the ROADRUNNER System
This paper presents roadRunner, a research project that aims at developing solutions for automatically extracting data from large HTML data sources. The target of our research are data-intensive Web sites, i.e., HTML-based sites with a fairly complex structure, that publish large amounts of data. The paper describes the top-level software architecture of the roadRunner System, and the novel res...
متن کاملAutomatically Generating Reports from Large Web Sites
Many large web sites contain highly valuable information. Their pages are dynamically generated by scripts which retrieve data from a back-end database and embed them in HTML templates. Based on this observation several techniques have been developed to automatically extract data from a set of structurally homogeneous pages. These tools represent a step towards the automatic extraction of data ...
متن کاملFunctionality-Based Web Image Categorization
The World Wide Web provides an increasingly powerful and popular publication mechanism. Web documents often contain a large number of images serving various different purposes. Identifying the functional categories of these images has important applications including information extraction, web mining, web page summarization and mobile access. This paper describes a study on the functional cate...
متن کاملEfficiently Locating Collections of Web Pages to Wrap
Many large web sites contain highly valuable information. Their pages are dynamically generated by scripts which retrieve data from a back-end database and embed them into HTML templates. Based on this observation several techniques have been developed to automatically extract data from a set of structurally homogeneous pages. These tools represent a step towards the automatic extraction of dat...
متن کامل